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Estimation of the mean and standard deviation using the closest two of three observations in a 
sample from a normal population with contamination by slippage of the mean is investigated by a 
sampling study. Lieblein's results, which indicated that the use of these statistics is not advisable for 
noncontaminated samples, are borne out by this study for contaminated samples as well. 
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1. Introduction 

In the physical sciences samples of only three meas- 
urements of a quantity are not uncommon, and esti- 
mation of the actual value of the quantity from these 
few measurements poses some difficult problems. If 
it is known that all of the measurements are good, that 
is, that they are measurements of the same quantity 
and that they contain no blunders or gross errors, 
then the sample mean has no serious competitor as 
an estimate of the true mean for measurement data 
which are approximately normally distributed. 

Quite often, though, there is a definite possibility 
that one or more of the measurements contain an error 
which is not just due to the uncertainties of the meas- 
urement process, but which results from a slip in the 
procedure, a failure of some component of the measure- 
ment apparatus, a misread dial, etc. Such a 
measurement is called a contaminant and sometimes, 
depending on the purpose of the experiment, should 
be discarded from the sample. Unfortunately, unless 
the error is very large it is usually difficult to deter- 
mine whether a measurement is a contaminant or not. 
The chemistry lab teacher who advises his students 
to take three measurements and use only the closest 
two of them in their calculations has recognized this 
problem and uses this device in an attempt to get 
robust estimates which are not likely to be as affected 
by a contaminant as the ordinary estimates are. The 
main purpose of this paper is to examine how sound 
this procedure is. 

Lieblein [1955] derived distributions of some sta- 
tistics, especially the mean and the range of the closest 
two of three independent observations from the same 
normal distribution, and he discussed their properties 
as estimators of the mean and the standard deviation 
of the population. He found them to be inefficient 
and generally unreliable compared to the mean and 
the range of all three observations. However, an 
experimenter would use the closest two of three 
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observations to compute his estimates only if he 
thought that the sample might be contaminated. 
Lieblein considered the null case where no contamina- 
tion exists, hence he has evaluated the penalty one 
must pay by using these protected estimates when 
they are not really needed. To complete the picture 
we must find out how much, if any, the experimenter 
stands to gain if there is contamination, and thus see 
how robust these estimates really are. This, of 
course, depends on how much and what kind of con- 
tamination is present, and this note describes the re- 
sults of a sampling experiment in the important case 
where the contamination is by slippage of the mean. 
It is assumed throughout that the standard deviation 
is not known. If some prior knowledge of the stand- 
ard deviation is available the contaminants are easier 
to detect and the treatment of the problem is changed. 

2. Estimation of the Mean With Exactly One 
Contaminant 

Let #1, # 2 , #3, he an independent sample of size 
three with %\ and x-> from a normal distribution with 
the mean /jl and the variance a 2 . Let %z be from a nor- 
mal distribution with the mean /x + ckr and variance 
cr 2 . The order statistics for the sample will be de- 
noted X(d < X( 2 ) < xcs). Let x' and x" be the closest 
two of the three observations with x < x". Then 
consider the following statistics as estimates of the 
mean; 

*=(l/3)^> 

m = X(2) 

y 3 = (l/2)(*' + *"). 

In particular we are interested in (1) the bias or the 
difference between the expected value of the statistic 
and /x, and (2) the root mean square error of the esti- 
mate from fUL. In the null case, 5=0, all three esti- 
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mates are unbiased and the root mean square errors of 
x and m are known to be 

{£[(x-m) 2 ]} 1/2 = ^==0.577ct, and 

{£[(™-^) 2 ]} l / 2 = 0.670cr; 
and of y 3 was found by Lieblein to be 



{E[(y 3 -fxfW 



i , V3y/ 2 



0.799a-. 



Thus, in the null case, the sample mean is about twice 
as efficient as y 3 . It should be noted that y 3 has a 
larger standard error than the mean of an uncontami- 
nated sample of size two has (0.707cr), so that when 
there is no contamination the chemistry student is 
better off taking just two measurements and averaging 
them than he is taking three and using the best two of 
the three. 

In the sampling experiment 1,000 samples of size 
three (containing two uncontaminated values, X\ and 
x 2 , and one contaminated value, Xs) were taken for 



each of the values 8 = 0, 1, 2, 3, 4, and 6, to see how 
the above estimators performed in nonnull situations. 
The bias and the root mean square error of these esti- 
mates as determined by the sampling experiment are 
graphed in figure 1. (The results of the sampling ex- 
periment agree with the exact values, which are known 
for 6 = and for x. almost to within the accuracy of 
the graphs. The graphs for the median, m, are exact 
values calculated from its probability distribution 
function.) 

The three lines in the graphs labeled 7}= 1/3, 
-n = l/6 and 17=1/11 correspond to estimates of the 
mean where either x or y 3 is used depending on the 
spacing of the three measurements as follows. If 
the spacing between the three measurements is about 
the same, then there is little evidence of a contaminant 
and one would want to use x. However, if one of the 
three is relatively far removed from the other two, 
then the natural tendency is to discard it and use the 
average of the other two, namely y 3 . A decision rule 
for this policy can be formulated as follows. Let 



CO = X(3) — X(i) . 



1 


, 


/x 
/(t 7 = 0) s 

/ i 


\.Oa 




v-i 


0.5(7 


i^^—Z' 


m 




/ y 2 \ 


V 1 




/ i 



















/(TO) 
/ 


l.5cr 














^-"V- 


I.Oo- 














m 

3r 






~T) 


6 


m 


(l 


2 


)*3 ■ 




r^V 














0.5C7 


_ X 
















Figure 1 

Exactly one contaminant 



Figure 2 
5% contamination 



Figure 3 
20% contamination 
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Figure 4 
Exactly one contaminant 
Means and root mean square 
errors for two estimates of the 
standard deviation. 
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Then, if — < 17, where 77 is a preassi^ned constant, 
use y.i as the estimate, and 11—^77 use #. for ex- 
ample, suppose 7] is chosen to be — , then if the meas- 
urement on one end is more than ten times as far 
from the middle measurement as the one on the other 
end, y ;i is used; otherwise x is used. Notice that 
7] = corresponds to using x always and r\— 2 cor- 
responds to using y 3 always. 

The bias in x increases linearly with 8 whereas that 
of 73 goes to since the two good measurements are 
almost always used for y 3 when 8 is large. For the 
same reason the bias in the median levels out to 
0.564cr, the expected value of the second order statistic 
in a sample of size two. 

The graph of the root mean square error shows 
that although in the null case y% is quite inefficient, 
it results in a real saving for large values of 8 and this 
fact would seem to support the opinion of the chemistry 
teacher. 



3. Estimation of the Mean With a Random 
Number of Contaminants 

The graphs discussed above can be misleading since 
they are for a model in which there is known to be 
exactly one contaminant. In practice the difficult 
task usually is to decide whether there is a contaminant 
present. Even in noncontaminated samples the ratio 

-^ can be deceptively small (the probability that -^ ^ — 

is about 0.157). Someone who does not fully ap- 
preciate the vagaries of small samples can easily be 
led to believe there is a contaminant present when 
there are none. 

Perhaps a more realistic model would be to assume 
that any one of the measurements has a certain chance 
of being contaminated, hence there could be 0, 1, 2, or 
even 3 contaminants in the sample. If there are two 
or three contaminants in a sample of three then esti- 
mation of the mean is hopeless anyway, but it is 
possible that the use of y 3 leads to a false sense of 
security, or does even more damage than x or #i, 
particularly when all contaminants are from the same 
source. 

In figures 2 and 3 are graphed the bias and the root 
mean square error for these estimates when there is 
a 5 percent and a 20 percent chance respectively that 
each particular measurement is a contaminant and 
when this probability is independent of the probability 
for the other two measurements. The graphs were 
calculated from the results of the sampling experiment 
above by the use of binomial probabilities. For this 
model y 3 loses much of its advantage, especially for 
the high contamination of 20 percent because there 
is a reasonable chance that two of the sample values 
are contaminants and then y 3 exhibits even a larger 
bias than x. Moreover, x has a uniformly smaller 
root mean square error than either y 3 or m has up to 



8 = 6. This, of course, is true only because in this 
model all contaminants have their mean displaced in 
the same direction. If contamination comes from both 
sides the bias may be eliminated. 

The graphs for the optional estimates with 77 = 1/3, 
1/6, and 1/11 are not given, but they lie midway be- 
tween the graphs of y 3 and x in that respective order. 



4. Estimation of the Standard Deviation 

The same sampling experiment was used to evaluate 
properties of different estimators of the standard 
deviation, cr. Two estimators were considered, 

a = 0.591co = 0.591(*( 3 ) - *to) 

? = 2.205y, = 2.205(jc"-x'). 

The estimate based on the range, a>, is the usual 
estimate for very small sample sizes, but again, if 

contamination is feared, one might want to use cr, the 
estimate based on the range of the closest two. The 
factors, 0.591 and 2.205, make the estimates unbiased 

in the null case. Lieblein [1955] has shown that cr is 

quite inefficient compared to cr in the null case, in fact 

{E(a-a) 2 } ll2 = 0.524<T 

{£(c?-crf}^ = 0.826cr. 

Here, just as for the mean, the range of just two true 
duplicates provides a better estimate than <x. 

The bias and the root mean square error of cr and cr 
as determined from the sampling experiment are 
graphed in units of cr in figure 4. These graphs are 
for the model in which there is exactly one contami- 
nant in the sample. Graphs for the lOOy percent 
contamination model (corresponding to figures 2 and 
3 for the mean) are not included in this note, but they 
also indicate the superiority of a over cr when all 
contaminants come from the same source and 8^3. 
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